Unknown Attribute Values in Induction
نویسنده
چکیده
Simple techniques for the development and use of decision tree classi ers assume that all attribute values of all cases are available Numerous approaches have been proposed with the aim of extending these techniques to cover real world situations in which unknown attribute values are not uncommon This paper compares the e ectiveness of several approaches as measured by their performance on a collection of datasets INTRODUCTION The standard technique for constructing a decision tree classi er from a training set of cases with known classes each described in terms of xed attributes can be summarised as follows If all training cases belong to a single class the tree is a leaf labelled with that class Otherwise select a test based on one attribute with mutually exclusive outcomes divide the training set into subsets each corresponding to one outcome and apply the same procedure to each subset Once constructed such a decision tree can be used to classify a new unseen case described in terms of the same attributes We start with the root of the tree If the current node is a leaf the case is assigned to the class associated with that leaf Otherwise the outcome of the test at the current node is determined and we follow the corresponding branch of the tree In real world applications it is not unusual to encounter cases some of whose attribute values are not known This causes three problems for the procedures sketched above The selection of a test to partition the training set may require comparison of tests based on attributes with di erent numbers of unknown values How should this comparison be made in a sensible manner Once a test has been selected based on attribute A say training cases with unknown values of A cannot be associated with any one outcome of the test How should these cases be treated in the division of the training set into subsets When the decision tree is used to classify an unseen case how should we proceed when we encounter a test on an attribute whose value is not known This paper evaluates several methods of circumventing these problems through controlled experiments on small variations Buchanan We start with a description of the datasets used in the trials DESCRIPTION OF DATASETS In the following the unknown rate of some attribute over a set of cases means the proportion of those cases whose value of that attribute is unknown To apply an unknown rate of x to some attribute we examine each case in the set and with probability x replace the value of the attribute with Breiman et al report experiments with a system called cart One domain involved recognising digits on a faulty element LED display each element of which has probability of having the wrong on o status Using a training set of cases Breiman et al observed the e ect on cart s retrial accuracy of applying various unknown rates to all seven attributes The rst dataset consists of their training set and a randomly generated test set of cases with an unknown rate of applied to all attributes Breiman et al found that for this induction task the reduced accuracy was almost entirely due to unknown values in the test set Two further datasets for this domain were derived from the above so as to highlight the e ect of unknown values on the tree construction process especially when attributes have di erent unknown rates The step variant uses a training set of cases and applies an unknown rate of to only four of the attributes The slope variant also uses training cases but applies an unknown rate of to the rst attribute to the second to the last The fourth dataset is a corrupted version of a chess endgame domain There are two classes and binary valued attributes with an unknown rate of applied to half of them Training and test sets number and cases respectively The remaining datasets are all from real world domains in which unknown values occur frequently Location of primary tumor has classes and attributes two of which have unknown rates of and respectively There are training cases and test cases The sick euthyroid dataset is a strati ed sample from a thyroid assay domain in which the ve key hormone measurements have been categorised as high normal or low and have unknown rates varying from to This dataset has training cases and test cases The auto insurance data has attributes and classes with a moderate level of unknown values of some attributes there are training cases and test cases For each dataset training and test sets were generated either by reapplying unknown rates the LED datasets or by randomly dividing the available data into training and test sets the others DESCRIPTION OF APPROACHES All the approaches described here were implemented as variants of a single tree building program that uses gain ratio an information based heuristic to select tests Quinlan The trees produced in these experiments were not pruned Quinlan b Several methods of overcoming the three problems have been explored Each of them has an identifying letter so that a package can be described succinctly by three letters denoting its approach to each problem When evaluating a test based on attribute A i Ignore cases in the training set with unknown values of A Friedman Breiman et al r Reduce the apparent information gain from testing A by the proportion of cases with unknown values of A The rationale for this reduction is that if A has an unknown rate of x testing A will yield no information x of the time s Fill in the missing values of A before calculating the gain of A Shapiro Shapiro s method builds a decision tree for each attribute that attempts to determine a case s value of that attribute in terms of the values of other attributes Quinlan The method of surrogate splits Breiman et al may be viewed as a special case of this approach c Similarly ll in unknown values of A with its most common known value before calculating gain Clark and Niblett When partitioning the training set using a test on attribute A and a training case has unknown value of A i Ignore this case Quinlan s Determine the likely value of A using Shapiro s method and assign it to the corresponding subset c Treat this case as if it had the most common value of A p Assign the case to one of the subsets with probability proportional to the number of cases with known value in each subset f Assign a fraction of this case to each subset using the proportions above Kononenko et al a Include the training case in all subsets Friedman u Develop a separate branch of the tree for cases with unknown values of attribute A When classifying a new case with unknown value of a tested attribute A u If there is a special branch for unknown value of A take it s Determine the most likely outcome of the test as above and act accordingly c Treat this case as if it had the most common value of A f Explore all branches combining the results to re ect the relative probabilities of the di erent outcomes Quinlan a h Halt at this point and assign the case to the most likely class Needless to say not all of the possible combinations of these methods make sense UNKNOWN VALUES WHEN PARTITIONING Seven packages that di er principally in their approach to partitioning were evaluated as follows For each dataset each of the ten training sets was used to construct trees whose error rates on the corresponding test set were measured The means of the error rates on the test cases and the standard errors of the sample means are shown in Table An analysis of signi cant di erences between packages brings out some interesting patterns For each dataset the results from each pair of packages were analysed to determine when one package was performing sig ni cantly better than the other The results of these signi cance tests are summarised in Table which shows for each package p the number of packages signi cantly worse p and better p than p on each dataset The entries have been sorted in terms of a rough index of merit p p and reveal the very clear superiority of rff assigning fractional cases to subsets and the equally clear undesirability of rif ignoring training cases with unknown values of the test attribute on these datasets UNKNOWN VALUES WHEN CLASSIFYING A similar set of experiments was used to examine alternative approaches when a case to be classi ed has an unknown value of a tested attribute Two trees were constructed for each training set using reduced gain for assessing tests and then replacement by the Shapiro tree or most common value respectively when partitioning The cases in the corresponding test set were then classi ed by both trees using three di erent strategies on encountering an unknown value For the test sets in each dataset we used the one tailed Student t test on pair di erences with a con dence level LED LED LED chess primary sick auto orig step slope endgame tumor euthyroid insurance
منابع مشابه
Multiple attribute group decision making with linguistic variables and complete unknown weight information
Interval type-2 fuzzy sets, each of which is characterized by the footprint of uncertainty, are a very useful means to depict the linguistic information in the process of decision making. In this article, we investigate the group decision making problems in which all the linguistic information provided by the decision makers is expressed as interval type-2 fuzzy decision matrices where each of ...
متن کاملHandling Unknown and Imprecise Attribute Values in Propositional Rule Learning: A Feature-Based Approach
Rule learning systems use features as the main building blocks for rules. A feature can be a simple attribute-value test or a test of the validity of a complex domain knowledge relationship. Most existing concept learning systems generate features in the rule construction process. However, the separation of feature generation and rule construction processes has several theoretical and practical...
متن کاملA comparison of traditional and rough set approaches to missing attribute values in data mining
Real-life data sets are often incomplete, i.e., some attribute values are missing. In this paper we compare traditional, frequently used methods of handling missing attribute values, which are based on preprocessing, with another class of methods dealing with missing attribute values in which rule induction is performed directly on incomplete data sets, i.e., handling missing attribute values a...
متن کاملMultiple attribute decision making with triangular intuitionistic fuzzy numbers based on zero-sum game approach
For many decision problems with uncertainty, triangular intuitionistic fuzzy number is a useful tool in expressing ill-known quantities. This paper develops a novel decision method based on zero-sum game for multiple attribute decision making problems where the attribute values take the form of triangular intuitionistic fuzzy numbers and the attribute weights are unknown. First, a new value ind...
متن کاملComparison of Various Routines for Unknown Attribute Value Processing The Covering Paradigm
Simple inductive learning algorithms assume that all attribute values are available. The well-known Quinlan's paper [Qui89] discusses quite a few routines for processing of unknown attribute values in the TDIDT family and analyzes seven of them. This paper introduces five routines for processing of unknown attribute values that have been designed for the CN4 learning algorithm, a large extensio...
متن کامل